Project 4 - Unsupervised Learning

Author - Shekar Roy

Part 1/5

DOMAIN:

Automobile

CONTEXT:

The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes

DATA DESCRIPTION:

The data concerns city-cycle fuel consumption in miles per gallon

Attribute Information:

1- mpg: continuous
2- cylinders(cyl): multi-valued discrete
3- displacement(disp): continuous
4- horsepower(hp): continuous
5- weight(wt): continuous
6- acceleration(acc): continuous
7- model year(yr): multi-valued discrete
8- origin: multi-valued discrete
9- car name: string (unique for each instance)

PROJECT OBJECTIVE:

Goal is to cluster the data and treat them as individual datasets to train Regression models to predict ‘mpg’

There are various ways to handle missing values. Drop the rows, replace missing values with median values etc. of the 398 rows 6 have NAN in the hp column. We could drop those 6 rows - which might not be a good idea under all situations. Here, we will replace them with their median values. First replace '?' with NaN and then replace NaN with median

Observation from above graph -> American 4 cylinder cars produced in 1973 with medium mpg level seem to dominate the dateset

Displacement and Horsepower seem to be skewed to the right

There appears to be a linear relationship between the variables

Except Year, most of the variable are correlated with each other

Hierarchical clustering

Appears to be to a visual clutter. However if seen from top, there are 2 probable clusters seen. We will have to analyze it more to find that out. So now, we'll go ahead and cut down the dendrogram to give us 2 clusters/groups

Clearly shows two distinct groups, with a difference in average between the clusters and variables

K-Means Clustering

Clearly shows two distinct groups, with a difference in average between the clusters and variables

Observation time!

Mention how many optimal clusters are present in the data

what could be the possible reason behind it.

How using different models for different clusters will be helpful in this case and how it will be different than using one single model without clustering? Mention how it impacts performance and prediction.

MLM Buliding

Linear regression on the original dataset

Linear regression on data with K means cluster

Linear regression on data with H-clusters

Improvisation/Suggestions

Part 2/5

DOMAIN:

Manufacturing

CONTEXT:

Company X curates and packages wine across various vineyards spread throughout the country.

DATA DESCRIPTION:

The data concerns the chemical composition of the wine and its respective quality. Attribute Information:

A, B, C, D: specific chemical composition measure of the wine
Quality: quality of wine [ Low and High ]

PROJECT OBJECTIVE:

Goal is to build a synthetic data generation model using the existing data provided by the company.

the chemical compositions are on the same scales between 0 to 200

There appears to be no misclassification when checking the it with the non missing target variables and the predicted clusters, Hence the new labels can be used as a target variable

Part 3/5

DOMAIN:

Automobile

CONTEXT:

DATA DESCRIPTION:

PROJECT OBJECTIVE:

Data

There appears to be quite a few missing values. we will replace them with median now

Columns have data distributed across multiple scales. Several columns have distributions that are not unimodal (eg.: distance_circularity, hollows_ratio, elongatedness Column skweness_about, skewness_about 1 have data that is right skewed whereas for column skewness_about 2 data is nearly normally distributed

Some columns have long right tail (eg.: pr.axis_aspect_ratio) as evident from the above distplot it is highly likely that they will have outliers

Above Boxplots reveal that there are outliers in 8 different columns., will have them treated eventually

There is significant difference between classes when compared with the mean and median with all the numeric attributes

outlier treatment!

much better!!

Performing dimensionality reduction before fitting a SVM , but will do a compare between them post PCA is complete.

PCA/ Dimensionality Reduction

We can see that the first six components explain more than 95% of variation. Between first 5 components, more than 91% of the information is captured. The above plot shows almost 95% variance by the first 6 components. Therefore we can drop 7th component onwards.

SVM on Original Dataset

SVM on PCA Dataset

Summary/Conclusion

Both the model gives more than 90% accuracy on the test data, However, PCA used only 6 attributes to come up with an accuracy of 90%+ where as the model with Original Data used all the variables (18 columns) to come up with 90%+ accuracy, the difference can be illustrated even better if the dataset had been cursed with dimensionality.

Part 4/5

DOMAIN:

Sports management

CONTEXT:

Company X is a sports management company for international cricket.

DATA DESCRIPTION:

The data is collected belongs to batsman from IPL series conducted so far. Attribute Information:

Runs: Runs score by the batsman
Ave: Average runs scored by the batsman per match
SR: strike rate of the batsman
Fours: number of boundary/four scored
Six: number of boundary/six scored
HF: number of half centuries scored so far

PROJECT OBJECTIVE:

Goal is to build a data driven batsman ranking model for the sports management company to make business decisions.

Strike rate, fours, sixes and half centuries have a skewed distribution

There appears to be outliers in SR, sizes, fours, HF, will not be treating them as its highly likely that these are genuine observation and are definately player dependent.

All the variable except fours with strike rate, strike rate with half centuries,strike rate with runs, have high correlation

clear indication from above elbow bend, that we can segregate players into 2 categories/groups

Part 5/5

Question - List of all Possible Dimensionality Techniques and when to use them

Dimensionality reduction techniques can be classified into two 3 types, Please find below

Feature selection:

Components / Factor Based:

Projection Based:

Question -

So far you have used dimensional reduction on numeric data. Is it possible to do the same on a multimedia data [images and video] and text data ? Please illustrate your findings using a simple implementation on python.

The projected data is now two-dimensional

Another way to gain intuition into the characteristics of the model is to plot the inputs again, with their predicted labels. We will use blue for correct labels, and red for incorrect labels